-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] Reduce znver3/4 LoopMicroOpBufferSize to practical loop unrolling values #91340
Conversation
@llvm/pr-subscribers-backend-x86 @llvm/pr-subscribers-llvm-transforms Author: Simon Pilgrim (RKSimon) ChangesThe znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance. From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy). This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu. Patch is 83.40 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/91340.diff 3 Files Affected:
diff --git a/llvm/lib/Target/X86/X86ScheduleZnver3.td b/llvm/lib/Target/X86/X86ScheduleZnver3.td
index 2e87d5262818c..cbf1de8408798 100644
--- a/llvm/lib/Target/X86/X86ScheduleZnver3.td
+++ b/llvm/lib/Target/X86/X86ScheduleZnver3.td
@@ -33,13 +33,10 @@ def Znver3Model : SchedMachineModel {
// The op cache is organized as an associative cache with 64 sets and 8 ways.
// At each set-way intersection is an entry containing up to 8 macro ops.
// The maximum capacity of the op cache is 4K ops.
- // Agner, 22.5 µop cache
- // The size of the µop cache is big enough for holding most critical loops.
- // FIXME: PR50584: MachineScheduler/PostRAScheduler have quadradic complexity,
- // with large values here the compilation of certain loops
- // ends up taking way too long.
- // let LoopMicroOpBufferSize = 4096;
- let LoopMicroOpBufferSize = 512;
+ // Assuming a maximum dispatch of 8 ops/cy and a mispredict cost of 12cy from
+ // the op-cache, we limit the loop buffer to 8*12 = 96 to avoid loop unrolling
+ // leading to excessive filling of the op-cache from frontend.
+ let LoopMicroOpBufferSize = 96;
// AMD SOG 19h, 2.6.2 L1 Data Cache
// The L1 data cache has a 4- or 5- cycle integer load-to-use latency.
// AMD SOG 19h, 2.12 L1 Data Cache
diff --git a/llvm/lib/Target/X86/X86ScheduleZnver4.td b/llvm/lib/Target/X86/X86ScheduleZnver4.td
index dac4d8422582a..7107dbc63e279 100644
--- a/llvm/lib/Target/X86/X86ScheduleZnver4.td
+++ b/llvm/lib/Target/X86/X86ScheduleZnver4.td
@@ -28,17 +28,11 @@ def Znver4Model : SchedMachineModel {
// AMD SOG 19h, 2.9.1 Op Cache
// The op cache is organized as an associative cache with 64 sets and 8 ways.
// At each set-way intersection is an entry containing up to 8 macro ops.
- // The maximum capacity of the op cache is 4K ops.
- // Agner, 22.5 µop cache
- // The size of the µop cache is big enough for holding most critical loops.
- // FIXME: PR50584: MachineScheduler/PostRAScheduler have quadradic complexity,
- // with large values here the compilation of certain loops
- // ends up taking way too long.
- // Ideally for znver4, we should have 6.75K. However we don't add that
- // considerting the impact compile time and prefer using default values
- // instead.
- // Retaining minimal value to influence unrolling as we did for znver3.
- let LoopMicroOpBufferSize = 512;
+ // The maximum capacity of the op cache is 6.75K ops.
+ // Assuming a maximum dispatch of 9 ops/cy and a mispredict cost of 12cy from
+ // the op-cache, we limit the loop buffer to 9*12 = 108 to avoid loop
+ // unrolling leading to excessive filling of the op-cache from frontend.
+ let LoopMicroOpBufferSize = 108;
// AMD SOG 19h, 2.6.2 L1 Data Cache
// The L1 data cache has a 4- or 5- cycle integer load-to-use latency.
// AMD SOG 19h, 2.12 L1 Data Cache
diff --git a/llvm/test/Transforms/LoopUnroll/X86/znver3.ll b/llvm/test/Transforms/LoopUnroll/X86/znver3.ll
index 30389062a0967..b1f1d7d814e6c 100644
--- a/llvm/test/Transforms/LoopUnroll/X86/znver3.ll
+++ b/llvm/test/Transforms/LoopUnroll/X86/znver3.ll
@@ -73,456 +73,8 @@ define i32 @test(ptr %ary) "target-cpu"="znver3" {
; CHECK-NEXT: [[INDVARS_IV_NEXT_14:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 15
; CHECK-NEXT: [[ARRAYIDX_15:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_14]]
; CHECK-NEXT: [[VAL_15:%.*]] = load i32, ptr [[ARRAYIDX_15]], align 4
-; CHECK-NEXT: [[SUM_NEXT_15:%.*]] = add nsw i32 [[VAL_15]], [[SUM_NEXT_14]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_15:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 16
-; CHECK-NEXT: [[ARRAYIDX_16:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_15]]
-; CHECK-NEXT: [[VAL_16:%.*]] = load i32, ptr [[ARRAYIDX_16]], align 4
-; CHECK-NEXT: [[SUM_NEXT_16:%.*]] = add nsw i32 [[VAL_16]], [[SUM_NEXT_15]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_16:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 17
-; CHECK-NEXT: [[ARRAYIDX_17:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_16]]
-; CHECK-NEXT: [[VAL_17:%.*]] = load i32, ptr [[ARRAYIDX_17]], align 4
-; CHECK-NEXT: [[SUM_NEXT_17:%.*]] = add nsw i32 [[VAL_17]], [[SUM_NEXT_16]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_17:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 18
-; CHECK-NEXT: [[ARRAYIDX_18:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_17]]
-; CHECK-NEXT: [[VAL_18:%.*]] = load i32, ptr [[ARRAYIDX_18]], align 4
-; CHECK-NEXT: [[SUM_NEXT_18:%.*]] = add nsw i32 [[VAL_18]], [[SUM_NEXT_17]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_18:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 19
-; CHECK-NEXT: [[ARRAYIDX_19:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_18]]
-; CHECK-NEXT: [[VAL_19:%.*]] = load i32, ptr [[ARRAYIDX_19]], align 4
-; CHECK-NEXT: [[SUM_NEXT_19:%.*]] = add nsw i32 [[VAL_19]], [[SUM_NEXT_18]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_19:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 20
-; CHECK-NEXT: [[ARRAYIDX_20:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_19]]
-; CHECK-NEXT: [[VAL_20:%.*]] = load i32, ptr [[ARRAYIDX_20]], align 4
-; CHECK-NEXT: [[SUM_NEXT_20:%.*]] = add nsw i32 [[VAL_20]], [[SUM_NEXT_19]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_20:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 21
-; CHECK-NEXT: [[ARRAYIDX_21:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_20]]
-; CHECK-NEXT: [[VAL_21:%.*]] = load i32, ptr [[ARRAYIDX_21]], align 4
-; CHECK-NEXT: [[SUM_NEXT_21:%.*]] = add nsw i32 [[VAL_21]], [[SUM_NEXT_20]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 22
-; CHECK-NEXT: [[ARRAYIDX_22:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_21]]
-; CHECK-NEXT: [[VAL_22:%.*]] = load i32, ptr [[ARRAYIDX_22]], align 4
-; CHECK-NEXT: [[SUM_NEXT_22:%.*]] = add nsw i32 [[VAL_22]], [[SUM_NEXT_21]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_22:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 23
-; CHECK-NEXT: [[ARRAYIDX_23:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_22]]
-; CHECK-NEXT: [[VAL_23:%.*]] = load i32, ptr [[ARRAYIDX_23]], align 4
-; CHECK-NEXT: [[SUM_NEXT_23:%.*]] = add nsw i32 [[VAL_23]], [[SUM_NEXT_22]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_23:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 24
-; CHECK-NEXT: [[ARRAYIDX_24:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_23]]
-; CHECK-NEXT: [[VAL_24:%.*]] = load i32, ptr [[ARRAYIDX_24]], align 4
-; CHECK-NEXT: [[SUM_NEXT_24:%.*]] = add nsw i32 [[VAL_24]], [[SUM_NEXT_23]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_24:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 25
-; CHECK-NEXT: [[ARRAYIDX_25:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_24]]
-; CHECK-NEXT: [[VAL_25:%.*]] = load i32, ptr [[ARRAYIDX_25]], align 4
-; CHECK-NEXT: [[SUM_NEXT_25:%.*]] = add nsw i32 [[VAL_25]], [[SUM_NEXT_24]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_25:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 26
-; CHECK-NEXT: [[ARRAYIDX_26:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_25]]
-; CHECK-NEXT: [[VAL_26:%.*]] = load i32, ptr [[ARRAYIDX_26]], align 4
-; CHECK-NEXT: [[SUM_NEXT_26:%.*]] = add nsw i32 [[VAL_26]], [[SUM_NEXT_25]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_26:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 27
-; CHECK-NEXT: [[ARRAYIDX_27:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_26]]
-; CHECK-NEXT: [[VAL_27:%.*]] = load i32, ptr [[ARRAYIDX_27]], align 4
-; CHECK-NEXT: [[SUM_NEXT_27:%.*]] = add nsw i32 [[VAL_27]], [[SUM_NEXT_26]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_27:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 28
-; CHECK-NEXT: [[ARRAYIDX_28:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_27]]
-; CHECK-NEXT: [[VAL_28:%.*]] = load i32, ptr [[ARRAYIDX_28]], align 4
-; CHECK-NEXT: [[SUM_NEXT_28:%.*]] = add nsw i32 [[VAL_28]], [[SUM_NEXT_27]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_28:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 29
-; CHECK-NEXT: [[ARRAYIDX_29:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_28]]
-; CHECK-NEXT: [[VAL_29:%.*]] = load i32, ptr [[ARRAYIDX_29]], align 4
-; CHECK-NEXT: [[SUM_NEXT_29:%.*]] = add nsw i32 [[VAL_29]], [[SUM_NEXT_28]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_29:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 30
-; CHECK-NEXT: [[ARRAYIDX_30:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_29]]
-; CHECK-NEXT: [[VAL_30:%.*]] = load i32, ptr [[ARRAYIDX_30]], align 4
-; CHECK-NEXT: [[SUM_NEXT_30:%.*]] = add nsw i32 [[VAL_30]], [[SUM_NEXT_29]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_30:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 31
-; CHECK-NEXT: [[ARRAYIDX_31:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_30]]
-; CHECK-NEXT: [[VAL_31:%.*]] = load i32, ptr [[ARRAYIDX_31]], align 4
-; CHECK-NEXT: [[SUM_NEXT_31:%.*]] = add nsw i32 [[VAL_31]], [[SUM_NEXT_30]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_31:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 32
-; CHECK-NEXT: [[ARRAYIDX_32:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_31]]
-; CHECK-NEXT: [[VAL_32:%.*]] = load i32, ptr [[ARRAYIDX_32]], align 4
-; CHECK-NEXT: [[SUM_NEXT_32:%.*]] = add nsw i32 [[VAL_32]], [[SUM_NEXT_31]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_32:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 33
-; CHECK-NEXT: [[ARRAYIDX_33:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_32]]
-; CHECK-NEXT: [[VAL_33:%.*]] = load i32, ptr [[ARRAYIDX_33]], align 4
-; CHECK-NEXT: [[SUM_NEXT_33:%.*]] = add nsw i32 [[VAL_33]], [[SUM_NEXT_32]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_33:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 34
-; CHECK-NEXT: [[ARRAYIDX_34:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_33]]
-; CHECK-NEXT: [[VAL_34:%.*]] = load i32, ptr [[ARRAYIDX_34]], align 4
-; CHECK-NEXT: [[SUM_NEXT_34:%.*]] = add nsw i32 [[VAL_34]], [[SUM_NEXT_33]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_34:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 35
-; CHECK-NEXT: [[ARRAYIDX_35:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_34]]
-; CHECK-NEXT: [[VAL_35:%.*]] = load i32, ptr [[ARRAYIDX_35]], align 4
-; CHECK-NEXT: [[SUM_NEXT_35:%.*]] = add nsw i32 [[VAL_35]], [[SUM_NEXT_34]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_35:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 36
-; CHECK-NEXT: [[ARRAYIDX_36:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_35]]
-; CHECK-NEXT: [[VAL_36:%.*]] = load i32, ptr [[ARRAYIDX_36]], align 4
-; CHECK-NEXT: [[SUM_NEXT_36:%.*]] = add nsw i32 [[VAL_36]], [[SUM_NEXT_35]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_36:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 37
-; CHECK-NEXT: [[ARRAYIDX_37:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_36]]
-; CHECK-NEXT: [[VAL_37:%.*]] = load i32, ptr [[ARRAYIDX_37]], align 4
-; CHECK-NEXT: [[SUM_NEXT_37:%.*]] = add nsw i32 [[VAL_37]], [[SUM_NEXT_36]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_37:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 38
-; CHECK-NEXT: [[ARRAYIDX_38:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_37]]
-; CHECK-NEXT: [[VAL_38:%.*]] = load i32, ptr [[ARRAYIDX_38]], align 4
-; CHECK-NEXT: [[SUM_NEXT_38:%.*]] = add nsw i32 [[VAL_38]], [[SUM_NEXT_37]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_38:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 39
-; CHECK-NEXT: [[ARRAYIDX_39:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_38]]
-; CHECK-NEXT: [[VAL_39:%.*]] = load i32, ptr [[ARRAYIDX_39]], align 4
-; CHECK-NEXT: [[SUM_NEXT_39:%.*]] = add nsw i32 [[VAL_39]], [[SUM_NEXT_38]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_39:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 40
-; CHECK-NEXT: [[ARRAYIDX_40:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_39]]
-; CHECK-NEXT: [[VAL_40:%.*]] = load i32, ptr [[ARRAYIDX_40]], align 4
-; CHECK-NEXT: [[SUM_NEXT_40:%.*]] = add nsw i32 [[VAL_40]], [[SUM_NEXT_39]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_40:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 41
-; CHECK-NEXT: [[ARRAYIDX_41:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_40]]
-; CHECK-NEXT: [[VAL_41:%.*]] = load i32, ptr [[ARRAYIDX_41]], align 4
-; CHECK-NEXT: [[SUM_NEXT_41:%.*]] = add nsw i32 [[VAL_41]], [[SUM_NEXT_40]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_41:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 42
-; CHECK-NEXT: [[ARRAYIDX_42:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_41]]
-; CHECK-NEXT: [[VAL_42:%.*]] = load i32, ptr [[ARRAYIDX_42]], align 4
-; CHECK-NEXT: [[SUM_NEXT_42:%.*]] = add nsw i32 [[VAL_42]], [[SUM_NEXT_41]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_42:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 43
-; CHECK-NEXT: [[ARRAYIDX_43:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_42]]
-; CHECK-NEXT: [[VAL_43:%.*]] = load i32, ptr [[ARRAYIDX_43]], align 4
-; CHECK-NEXT: [[SUM_NEXT_43:%.*]] = add nsw i32 [[VAL_43]], [[SUM_NEXT_42]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_43:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 44
-; CHECK-NEXT: [[ARRAYIDX_44:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_43]]
-; CHECK-NEXT: [[VAL_44:%.*]] = load i32, ptr [[ARRAYIDX_44]], align 4
-; CHECK-NEXT: [[SUM_NEXT_44:%.*]] = add nsw i32 [[VAL_44]], [[SUM_NEXT_43]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_44:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 45
-; CHECK-NEXT: [[ARRAYIDX_45:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_44]]
-; CHECK-NEXT: [[VAL_45:%.*]] = load i32, ptr [[ARRAYIDX_45]], align 4
-; CHECK-NEXT: [[SUM_NEXT_45:%.*]] = add nsw i32 [[VAL_45]], [[SUM_NEXT_44]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_45:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 46
-; CHECK-NEXT: [[ARRAYIDX_46:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_45]]
-; CHECK-NEXT: [[VAL_46:%.*]] = load i32, ptr [[ARRAYIDX_46]], align 4
-; CHECK-NEXT: [[SUM_NEXT_46:%.*]] = add nsw i32 [[VAL_46]], [[SUM_NEXT_45]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_46:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 47
-; CHECK-NEXT: [[ARRAYIDX_47:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_46]]
-; CHECK-NEXT: [[VAL_47:%.*]] = load i32, ptr [[ARRAYIDX_47]], align 4
-; CHECK-NEXT: [[SUM_NEXT_47:%.*]] = add nsw i32 [[VAL_47]], [[SUM_NEXT_46]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_47:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 48
-; CHECK-NEXT: [[ARRAYIDX_48:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_47]]
-; CHECK-NEXT: [[VAL_48:%.*]] = load i32, ptr [[ARRAYIDX_48]], align 4
-; CHECK-NEXT: [[SUM_NEXT_48:%.*]] = add nsw i32 [[VAL_48]], [[SUM_NEXT_47]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_48:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 49
-; CHECK-NEXT: [[ARRAYIDX_49:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_48]]
-; CHECK-NEXT: [[VAL_49:%.*]] = load i32, ptr [[ARRAYIDX_49]], align 4
-; CHECK-NEXT: [[SUM_NEXT_49:%.*]] = add nsw i32 [[VAL_49]], [[SUM_NEXT_48]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_49:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 50
-; CHECK-NEXT: [[ARRAYIDX_50:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_49]]
-; CHECK-NEXT: [[VAL_50:%.*]] = load i32, ptr [[ARRAYIDX_50]], align 4
-; CHECK-NEXT: [[SUM_NEXT_50:%.*]] = add nsw i32 [[VAL_50]], [[SUM_NEXT_49]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_50:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 51
-; CHECK-NEXT: [[ARRAYIDX_51:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_50]]
-; CHECK-NEXT: [[VAL_51:%.*]] = load i32, ptr [[ARRAYIDX_51]], align 4
-; CHECK-NEXT: [[SUM_NEXT_51:%.*]] = add nsw i32 [[VAL_51]], [[SUM_NEXT_50]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_51:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 52
-; CHECK-NEXT: [[ARRAYIDX_52:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_51]]
-; CHECK-NEXT: [[VAL_52:%.*]] = load i32, ptr [[ARRAYIDX_52]], align 4
-; CHECK-NEXT: [[SUM_NEXT_52:%.*]] = add nsw i32 [[VAL_52]], [[SUM_NEXT_51]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_52:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 53
-; CHECK-NEXT: [[ARRAYIDX_53:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_52]]
-; CHECK-NEXT: [[VAL_53:%.*]] = load i32, ptr [[ARRAYIDX_53]], align 4
-; CHECK-NEXT: [[SUM_NEXT_53:%.*]] = add nsw i32 [[VAL_53]], [[SUM_NEXT_52]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_53:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 54
-; CHECK-NEXT: [[ARRAYIDX_54:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_53]]
-; CHECK-NEXT: [[VAL_54:%.*]] = load i32, ptr [[ARRAYIDX_54]], align 4
-; CHECK-NEXT: [[SUM_NEXT_54:%.*]] = add nsw i32 [[VAL_54]], [[SUM_NEXT_53]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_54:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 55
-; CHECK-NEXT: [[ARRAYIDX_55:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_54]]
-; CHECK-NEXT: [[VAL_55:%.*]] = load i32, ptr [[ARRAYIDX_55]], align 4
-; CHECK-NEXT: [[SUM_NEXT_55:%.*]] = add nsw i32 [[VAL_55]], [[SUM_NEXT_54]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_55:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 56
-; CHECK-NEXT: [[ARRAYIDX_56:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_55]]
-; CHECK-NEXT: [[VAL_56:%.*]] = load i32, ptr [[ARRAYIDX_56]], align 4
-; CHECK-NEXT: [[SUM_NEXT_56:%.*]] = add nsw i32 [[VAL_56]], [[SUM_NEXT_55]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_56:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 57
-; CHECK-NEXT: [[ARRAYIDX_57:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_56]]
-; CHECK-NEXT: [[VAL_57:%.*]] = load i32, ptr [[ARRAYIDX_57]], align 4
-; CHECK-NEXT: [[SUM_NEXT_57:%.*]] = add nsw i32 [[VAL_57]], [[SUM_NEXT_56]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_57:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 58
-; CHECK-NEXT: [[ARRAYIDX_58:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_57]]
-; CHECK-NEXT: [[VAL_58:%.*]] = load i32, ptr [[ARRAYIDX_58]], align 4
-; CHECK-NEXT: [[SUM_NEXT_58:%.*]] = add nsw i32 [[VAL_58]], [[SUM_NEXT_57]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_58:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 59
-; CHECK-NEXT: [[ARRAYIDX_59:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_58]]
-; CHECK-NEXT: [[VAL_59:%.*]] = load i32, ptr [[ARRAYIDX_59]], align 4
-; CHECK-NEXT: [[SUM_NEXT_59:%.*]] = add nsw i32 [[VAL_59]], [[SUM_NEXT_58]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_59:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 60
-; CHECK-NEXT: [[ARRAYIDX_60:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_59]]
-; CHECK-NEXT: [[VAL_60:%.*]] = load i32, ptr [[ARRAYIDX_60]], align 4
-; CHECK-NEXT: [[SUM_NEXT_60:%.*]] = add nsw i32 [[VAL_60]], [[SUM_NEXT_59]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_60:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 61
-; CHECK-NEXT: [[ARRAYIDX_61:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_60]]
-; CHECK-NEXT: [[VAL_61:%.*]] = load i32, ptr [[ARRAYIDX_61]], align 4
-; CHECK-NEXT: [[SUM_NEXT_61:%.*]] = add nsw i32 [[VAL_61]], [[SUM_NEXT_60]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_61:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 62
-; CHECK-NEXT: [[ARRAYIDX_62:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_61]]
-; CHECK-NEXT: [[VAL_62:%.*]] = load i32, ptr [[ARRAYIDX_62]], align 4
-; CHECK-NEXT: [[SUM_NEXT_62:%.*]] = add nsw i32 [[VAL_62]], [[SUM_NEXT_61]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_62:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 63
-; CHECK-NEXT: [[ARRAYIDX_63:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_62]]
-; CHECK-NEXT: [[VAL_63:%.*]] = load i32, ptr [[ARRAYIDX_63]], align 4
-; CHECK-NEXT: [[SUM_NEXT_63:%.*]] = add nsw i32 [[VAL_63]], [[SUM_NEXT_62]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_63:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 64
-; CHECK-NEXT: [[ARRAYIDX_64:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_63]]
-; CHECK-NEXT: [[VAL_64:%.*]] = load i32, ptr [[ARRAYIDX_64]], align 4
-; CHECK-NEXT: [[SUM_NEXT_64:%.*]] = add nsw i32 [[VAL_64]], [[SUM_NEXT_63]]
-; CHECK-NEXT: [[INDVARS_IV_NEXT_64:%.*]] = add nuw nsw i64 [[INDVARS_IV...
[truncated]
|
I didn't really get why it makes sense to multiply the uop fetch rate with the mispredict penalty. Also, while znver4 might in theory support 9 uops per cycle, isn't it limited by the 6 wide renamer in practice? |
I am not sure why you mention renamer. It has a role to play in the throughput however it is not strictly limited to renamer's capacity right? Renamer's capability will be on case-case basis but theoretical limit will be 9uops. Also, renamer can be split across integer\fp(vector). So, I don't think we should restrict it with renamer's capability. @RKSimon I think we should add a tuning flag whether a subtarget is willing to use this LoopMicroOpBufferSize for unrolling decision. I agree that the metric you are proposing is serving the purpose but the term LoopMicroOpBufferSize in itself is misleading and is not representative. |
We shouldn't need a TLI/TTI control for this - either removing the LoopMicroOpBufferSize entry (see znver1/2) or explicitly setting it to 0 has a similar effect. But I'm not certain if we want to do this for znver3/4 or not - I don't have access to hardware to test this. |
Would people prefer we just drop the LoopMicroOpBufferSize entry from the znver3/4 models (same as znver1/2)? This prevents most loop unrolling and we then rely on the cpu's op cache higher decode rate to get higher performance (but we end up testing every loop). |
I have at least one counter example which gains with the LoopMicroOpBufferSize setting we have for znver3/4. Let us go by your deduction of the metric for LoopMicroOpBufferSize based on the misprediction penalty. |
matrix_vector_mul.zip |
Thanks @ganeshgit - are you happy to accept this patch as it is then? |
@RKSimon Sure thanks a lot! LGTM! |
6ba80c5
to
e882246
Compare
The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance. From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy). This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.
e882246
to
bfc0e78
Compare
The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance.
From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy).
This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.